
CHAPTER 9. CONVOLUTIONAL NETWORKS
nicknamed “grandmother cells”—the idea is that a person could have a neuron that
activates when seeing an image of their grandmother, regardless of whether she
appears in the left or right side of the image, whether the image is a close-up of
her face or zoomed out shot of her entire body, whether she is brightly lit, or in
shadow, etc.
These grandmother cells have been shown to actually exist in the human brain,
in a region called the medial temporal lobe (Quiroga et al., 2005). Researchers
tested whether individual neurons would respond to photos of famous individuals.
They found what has come to be called the “Halle Berry neuron”: an individual
neuron that is activated by the concept of Halle Berry. This neuron fires when a
person sees a photo of Halle Berry, a drawing of Halle Berry, or even text containing
the words “Halle Berry.” Of course, this has nothing to do with Halle Berry herself;
other neurons responded to the presence of Bill Clinton, Jennifer Aniston, etc.
These medial temporal lobe neurons are somewhat more general than modern
convolutional networks, which would not automatically generalize to identifying
a person or object when reading its name. The closest analog to a convolutional
network’s last layer of features is a brain area called the inferotemporal cortex
(IT). When viewing an object, information flows from the retina, through the
LGN, to V1, then onward to V2, then V4, then IT. This happens within the first
100ms of glimpsing an object. If a person is allowed to continue looking at the
object for more time, then information will begin to flow backwards as the brain
uses top-down feedback to update the activations in the lower level brain areas.
However, if we interrupt the person’s gaze, and observe only the firing rates that
result from the first 100ms of mostly feedforward activation, then IT proves to be
very similar to a convolutional network. Convolutional networks can predict IT
firing rates, and also perform very similarly to (time limited) humans on object
recognition tasks (DiCarlo, 2013).
That being said, there are many differences between convolutional networks
and the mammalian vision system. Some of these differences are well known
to computational neuroscientists, but outside the scope of this book. Some of
these differences are not yet known, because many basic questions about how the
mammalian vision system works remain unanswered. As a brief list:
•
The human eye is mostly very low resolution, except for a tiny patch called the
fovea
. The fovea only observes an area about the size of a thumbnail held at
arms length. Though we feel as if we can see an entire scene in high resolution,
this is an illusion created by the subconscious part of our brain, as it stitches
together several glimpses of small areas. Most convolutional networks actually
receive large full resolution photographs as input. The human brain makes
366